Skip to content

fix(rbac): Add missing agentruntimes permissions to ClusterRole#253

Open
Alan-Cha wants to merge 1 commit intomainfrom
fix/rbac-agentruntimes-permissions
Open

fix(rbac): Add missing agentruntimes permissions to ClusterRole#253
Alan-Cha wants to merge 1 commit intomainfrom
fix/rbac-agentruntimes-permissions

Conversation

@Alan-Cha
Copy link
Copy Markdown
Contributor

Problem

The kagenti-operator is deployed with insufficient RBAC permissions, causing continuous errors in production:

agentruntimes.agent.kagenti.dev is forbidden: User "system:serviceaccount:kagenti-operator-system:controller-manager" 
cannot list resource "agentruntimes" in API group "agent.kagenti.dev" at the cluster scope

Logs showing the issue:

W0327 18:56:01.604062       1 reflector.go:569] failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden
E0327 18:56:01.604188       1 reflector.go:166] "Unhandled Error" err="Failed to watch *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden"

This error repeats continuously (exponential backoff), filling operator logs and preventing the AgentRuntimeReconciler from functioning.

Root Cause

Mismatch between code and Helm chart:

  1. Code declares required RBAC in internal/controller/agentruntime_controller.go:66-69:

    // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes,verbs=get;list;watch;create;update;patch;delete
    // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes/status,verbs=get;update;patch
    // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes/finalizers,verbs=update
  2. Controller is always registered in cmd/main.go:323-330

  3. Helm chart ClusterRole is missing these permissions in charts/kagenti-operator/templates/rbac/role.yaml

Solution

Add the missing agentruntimes permissions to the ClusterRole Helm template to match the controller's kubebuilder RBAC annotations.

Changes:

  • Added agentruntimes resource permissions (get, list, watch, create, update, patch, delete)
  • Added agentruntimes/status subresource permissions (get, update, patch)
  • Added agentruntimes/finalizers subresource permissions (update)

Impact

Before fix:

  • ❌ Permission errors fill operator logs
  • ❌ AgentRuntime controller cannot reconcile
  • ❌ Per-workload identity configuration non-functional
  • ❌ Per-workload observability overrides non-functional

After fix:

  • ✅ Permission errors eliminated
  • ✅ AgentRuntime controller can list/watch/reconcile CRDs
  • ✅ Per-workload configuration features functional
  • ✅ Clean operator logs

Testing

Deployed operator with this fix in kind cluster:

  1. Permission errors stopped immediately after ClusterRole update
  2. AgentRuntime controller successfully lists/watches CRDs
  3. No regressions in other controllers (AgentCard, ClientRegistration, etc.)

Type of Change

  • Bug fix (fixes an issue in existing functionality)
  • New feature
  • Breaking change

Checklist

  • Code follows project style guidelines
  • Changes have been tested locally
  • Commit message follows conventional commits
  • DCO sign-off included
  • Documentation updated (if needed)

Related Issues

This is a pre-existing bug affecting all deployments of kagenti-operator. No specific issue filed, discovered during E2E testing of PR #247.

AgentRuntimeReconciler has been deployed in production without the
necessary RBAC permissions, causing continuous permission errors in
operator logs.

## Problem

The operator's ServiceAccount cannot list/watch AgentRuntime CRDs:

```
agentruntimes.agent.kagenti.dev is forbidden: User
"system:serviceaccount:kagenti-operator-system:controller-manager"
cannot list resource "agentruntimes" in API group "agent.kagenti.dev"
at the cluster scope
```

This error repeats continuously with exponential backoff, filling logs
and preventing AgentRuntime reconciliation.

## Root Cause

1. AgentRuntimeReconciler is always registered (cmd/main.go:323-330)
2. Controller declares required RBAC in code annotations:
   ```go
   // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes,verbs=get;list;watch;create;update;patch;delete
   // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes/status,verbs=get;update;patch
   // +kubebuilder:rbac:groups=agent.kagenti.dev,resources=agentruntimes/finalizers,verbs=update
   ```
3. Helm chart ClusterRole template is missing these permissions

## Solution

Add agentruntimes permissions to charts/kagenti-operator/templates/rbac/role.yaml
matching the kubebuilder RBAC annotations in agentruntime_controller.go.

## Impact

- Fixes permission errors in operator logs
- Enables AgentRuntime controller to function correctly
- Allows per-workload identity/observability configuration

## Testing

Deployed operator with fix in kind cluster:
- Permission errors stopped immediately
- AgentRuntime controller can now list/watch CRDs
- No regressions in other controllers

Fixes a pre-existing bug affecting all deployments.

Signed-off-by: Alan Cha <Alan.cha1@ibm.com>
@Alan-Cha Alan-Cha requested a review from a team as a code owner March 27, 2026 19:32
@Alan-Cha
Copy link
Copy Markdown
Contributor Author

Full Error Logs from Operator

These errors repeat continuously in the kagenti-operator logs (exponential backoff):

W0327 18:56:01.604062       1 reflector.go:569] pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User "system:serviceaccount:kagenti-operator-system:controller-manager" cannot list resource "agentruntimes" in API group "agent.kagenti.dev" at the cluster scope

E0327 18:56:01.604188       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: Failed to watch *v1alpha1.AgentRuntime: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User \"system:serviceaccount:kagenti-operator-system:controller-manager\" cannot list resource \"agentruntimes\" in API group \"agent.kagenti.dev\" at the cluster scope" logger="UnhandledError"

W0327 18:56:02.763297       1 reflector.go:569] pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User "system:serviceaccount:kagenti-operator-system:controller-manager" cannot list resource "agentruntimes" in API group "agent.kagenti.dev" at the cluster scope

E0327 18:56:02.763535       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: Failed to watch *v1alpha1.AgentRuntime: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User \"system:serviceaccount:kagenti-operator-system:controller-manager\" cannot list resource \"agentruntimes\" in API group \"agent.kagenti.dev\" at the cluster scope" logger="UnhandledError"

W0327 18:56:05.376465       1 reflector.go:569] pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User "system:serviceaccount:kagenti-operator-system:controller-manager" cannot list resource "agentruntimes" in API group "agent.kagenti.dev" at the cluster scope

E0327 18:56:05.376588       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: Failed to watch *v1alpha1.AgentRuntime: failed to list *v1alpha1.AgentRuntime: agentruntimes.agent.kagenti.dev is forbidden: User \"system:serviceaccount:kagenti-operator-system:controller-manager\" cannot list resource \"agentruntimes\" in API group \"agent.kagenti.dev\" at the cluster scope" logger="UnhandledError"

How to reproduce:

  1. Deploy kagenti-operator from main branch
  2. Check operator logs: kubectl logs -n kagenti-operator-system deployment/kagenti-controller-manager
  3. Observe continuous RBAC permission errors

Verification after fix:

  1. Applied this RBAC fix to cluster
  2. Operator logs immediately stopped showing permission errors
  3. AgentRuntime controller can now successfully list/watch CRDs

Copy link
Copy Markdown
Collaborator

@cwiklik cwiklik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Correct fix for the missing agentruntimes RBAC permissions — the added rules match the kubebuilder markers exactly and the PR description is thorough with clear root cause analysis.

Overlap with #249: PR #249 (already approved) does a comprehensive alignment of this same file, adding agentruntimes (same fix) and removing 79 lines of over-provisioned rules (secrets, CRDs, webhooks, RBAC management, deprecated extensions API group). These two PRs will have merge conflicts on charts/kagenti-operator/templates/rbac/role.yaml. Recommend coordinating merge order:

  • If #249 merges first, this PR becomes redundant
  • If this merges first, #249 needs a rebase

Areas reviewed: Helm/K8s RBAC
Commits: 1 commit, signed-off: yes
CI status: All 14 checks passing (including E2E)

verbs:
- create
- delete
- get
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (coordination): This exact change is also included in PR #249, which does a broader RBAC cleanup aligning the entire Helm ClusterRole with config/rbac/role.yaml. PR #249 adds agentruntimes (same rules as here) plus removes ~79 lines of over-provisioned permissions the operator doesn't use (secrets, CRDs, webhooks, RBAC, deprecated extensions API group, etc.).

These two PRs will conflict on this file. If #249 merges first, this PR is fully superseded. Worth coordinating merge order with @ChristianZaccaria.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it makes sense to merge #249 as the changes are more extensive. We can close this after that one is merged.

@Alan-Cha
Copy link
Copy Markdown
Contributor Author

Severity Clarification

This bug completely breaks the AgentRuntime feature, which is documented as "the declarative way to enroll a workload into the Kagenti platform" (docs/architecture.md).

Impact on Users

Users creating AgentRuntime resources expecting agent enrollment will see:

  • ❌ AgentRuntime CR stuck in Pending phase
  • ❌ Target workload never receives kagenti.io/type label
  • ❌ Webhook never injects sidecars (no label to trigger on)
  • ❌ Per-agent identity/observability config never applied
  • ❌ Config changes never trigger rolling updates

Workaround: Users must manually add kagenti.io/type: agent labels to Deployments, but this bypasses the entire AgentRuntime configuration system.

Root Cause

AgentRuntime was added in commit aeb0189 (March 10, 2026) without the corresponding RBAC permissions. Released in v0.2.0-alpha.22 with this bug present from day 1.

Recommendation

This should be treated as a P0 bug fix for the AgentRuntime feature. All v0.2.0-alpha.* releases with AgentRuntime have this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants